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ABSTRACT 



In this paper 45 item-writing rules f or^ 
multiple-choice tests presented in textbooks on educational 
measurement in a previous study are' identif ied. The; current study 
presents a quantitative review of* the literature with respect' to the 
empirical and theoretical evaluation of these principles of 
item-writing. Fifty-Six studies that addressed at least one of the ,45 
ite^m- writing rules we^ identified. Twenty-one of, the rules have, been 
%tudied empirically; 24 item-writing rules have no. empirical basis. 
The optimal number of options was the most frequently studied' rule, 
with 18 studies cited. The major generalization from these studies is 
that three options maximize test reliability and efficiency. Type-k 
items were evaluated in eight studies. Results; suggest that compared 
to single-answer multiple-choice items,, type -It items are mor« 
difficult, 'provide clues to some examinees, and decrease test 
reliability and efficiency.' Eight other studies suggest some 
empirical t?asis for keeping the length of the' keyed option about the 
same as other optipns . This -review suggests that the majority of the 
common principles of multiple-choice item writing are not empirically 
based. (Author/DWH) * 
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ABSIgACT 

* > * 

In a previous study the authors identified 45 item-writing rules for 
multiple-chioce tests, presented by authbrs of textbooks in educational 
measurement. IJie current study reports a quantitative review of the 
literature with respect to* the empirical- and theoretical evaluaton of these 
principles of itTem writing. , ^ 

^ Fifty^six studies fchat addressed at least one of the 45 item-writing 
riiles were identified* Twenty-one (47Z) of the rules have been studied 
empirically; twenty-four (532) item-writing rules have no empiri^l basis. — -* 

The optimal number of options was the most frequently studied rule, with 

18 studies cited- The major generalization from these studies is that three 

* ■ t r 

options maximize test reliability and efficiency* 

Type-k items were evaluated in eight studies. Results suggest that 

compared to single-answer multiple-choice Items, type-k items are more^ 

dif f icult , ^provide clues to ^ome examinees, and decrease test reliability and 

efficiency- Eight other studies suggest some empirical basis for keeping ia& 

length of the keyed option about the sam§ as other options. All other rules 

had six or fewer studies. • ^ 

r 

'This review suggei^s^hat the majority of the common principles of 
multiple-choice item writing are not empirically based. Current item-writing 
practices remain more art than science. 

\ ; 

A paper presented at the annual meeting of the American Educational Research 

* Association, Chicago, XL. , April 1985. 

\ 



Mogt roultiple^choice atom writers receive initial instruction from any 

« * m - 

% * ' ' : . ' r 

number of textbooks that deal with educational or psychological testing. The 
sum of knowledge about muitiplB-chpice Jtem writing is not found in any single 
reference but exists' as lore passed down from generation to generation through 
these textbooks* Despite many advances in test theory in recent years f such 
as general izabiiity theory (Brennan, 1983) f and item response" theory (Lord, 
1980), item writing has not yet advanced 'far as a science, aLthough a number 
of theories of item writing have been proposed (Bo^rmuth, 1970; 3oid and 
KaiadynaJ 1982), 9 r 



The present study is the second in a series of studies concerning 
multiple-choice item-writ ilig practices* The objective N in the first study- 
(Haladyna & Downing, 1984) was >to examine these textbooks and identify the 
core of knowledge flbout multiple-choice item writing* The objective in this 
second stjiidy is to examine the research base that supports item-writing 
practice^ as promulgated in these textbooks* The studies date from 1925 to 
1984 and span a wide variety of test content^ educational levels, test types, 
and, of course, item-writing practices. Quantitative methods were used in an 
effort to, synthesize the results found in the studies* Before these are 
discussed, however, the Haladyna and Downing (1984) study will be reviewed as 
a means o£ presenting the basis for the present research. 

An Analysis of Knowledge About Multiple-Choice Item Writing • ( 

■ 'v 

Thirty-five textbooks ,that represent a wide range of perspectives and 
periods in educational testing have been identified (see Appendix A). 
Instructional statements were identified in th^se textbooks, and organized by 
six fundamental categories: (1) general item-waiting advice — content concerns 
(2) general item-writing advice — construction, (3) item advice focusing on 
seem construction, (4) general advice focusing, on option construction, 



(5) advice focusing^ on construction of the correct option* and (6) advice / 
focusing on the construction of the distrainers (incorrect options), 
^ The researchers identified which passages in each textbook discussed 



multiple-choice t item writing, and classified all of the instructional 

# » 

y r S * 

statements contained ^in these passages. It was possible to construct an* 



^author-byrrule matrix and observe the number oi instri/ctionat statements made 
by each texDbook* author , the frequency of cjccurrepce of each rule across all 

textbooks/ and the number of different rules that existed in the textbook 

t * * 7 ' " ■ 

, . * ^ A 

literature. f , # 

Initially, 50 rules were identified. , Upon closer examination, the lis t ^ 

,i 

was refined to 45 rules, ajjji these formed the basis* for the present study. 

Table 1 summarises -the 45 rules according to the six categories previously * 

•discussed. ' { o 

Of these rules, 14 were identified as appearing most frequently. Three 

* --- . 

of these rules could be ctSnsidered' general .advice, eight as /advice on optioh 

construction, and the remaining three suggestions on distracter construction. 

Interestingly, many of the most frequently sed rules are the kind tfrat are 

i * 

empirically testable (e.g., avoid the use of "none of the above"), rather than 

* 41 ' 

the tj^pe of rule that is based largely* on pommon* sense and i^not easily 
empirically testable (e.g., "avoid items based on opinions" or "make a good 
transition from stem to option").' r • 

With respect to the frequency with which these rubes are cited, *-Ebel 
( 1979) l|d ail other textbook authors. t However, he cited only 58% of ail 
rules. For. other authors, this percentage of citatiojft, of all rules ragged 
downward to 20%. , % 



Finally, tfte researchers identified the number of citation! to research 
on faultiple-choice item writing- that each textbook contained- The number of 
citations ranged from ze^o to 24, with a median of 2^5. 

Thus it would appear that a bfrdy-of knowledge does\xxst for multiple- 
"choice item writing. Most authors, however f% do not appear to , use the jresearcfc 
literature to substantiate th^ir Advice. They, may instead depend on what they 
have learned through courses they have taken, experiences in item writing, and 
other sources. 

Vi . Tt^e present research study was a natural consequence of the first I The 

* . ^ y 

abjes*5*ve of this study, as mentidned earlier in this paper, was to -explore 
the research basis for these instructional statements on item writing. More 
specif ically, three questions were, addressed in the present study:* 
1. How many studies dealjwith these item-writing rules? 
( 2* Which o£ the item writing rules have been most often studied? 
3. For the rules that have been most often studied, what 
conclusions can be drawn regarding their validity? 

> ^ METHOD 

Design of tne Study , 

This review is quantitative in the ,sense that the number of studies, 
reported in the literature and the frequency wit;h which item-writing rule? 
have- been studied are its central foci, further, results were evaluated in- ' 
terms of ratings of effects rather than by other more "subjective methods. 
This procedure is £ mi.ddle ground between 'the more traditionai review 
procedure and meta-analysis. The former type of research method is flawed by 
the problem of subjectivity, The latter requires a large number, of studies 
and data that can be aggregated, neither^of which could be Obtained for this 
review. It was not possible, therefore, to use meta-analysis techniques. 
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Search Procedures ' , "* 

_ , . _ - 

The search for research studies dealing with any of the 45 item-writing 
rules began with a computerized Literature search on the topic "item 

, * . * . . : , . • • • . * 

writing." Each of the papers identified was reviewed and was either^accepted 

or rejected for further consideration. Many papers dealt with theoretical 

approaches \o item writing, such as those found in Roiii and Haladyna (1982), 

and these were eliminated because these item-writing rules werk<not the 

concern^of this review.- References from those papers included, in this s^udy 

were examined for Leads to other studies. Thi$ process assured that prose 

? • . ; * • 

relevant research was identified and included in the present study. 

Metnod for Class ifying Studies - ■ 

tk ; — 

A coding sheet was used to classify each^study. The uypes "bf information 
coded included (1) sample size, (2)*" test Length, (3) type o£ test 
(i.e., standardized achievement, classroom achievement, or apti'tude-abiiity!) , • 

, «. .. . 

(4) approximate educational level of the examinees, (5*) rules studied, , 

f ( 

{($) methodological problems, and (7) a rating for each criterion involving 
t 

each rule. . v " 

Results were evaluated on the basis of six criteria' typically used in 

* * 
,these studies: (L) item difficulty, (2) item discrimination, (3) reliability, 

(4) validity, (5) efficiency (the time it takes to complete a test), and (6) 



test score variance. 
. " Both authors of the present' study validated the rating form by 

. ; / 

individually bating five studies and comparing ratings. The findings ^showed 
concurrence, so the balance of these papers were divided for review. In the 
course of synthesizing these sbudies, all studies were reviewed again, and 
discrepancies in classification were resolved through mutual agreement. 

% ' . ' ' ' 

ft 

7 J ■ 



• * • ) * • 

To answer the study question about hov many* studies address item-writing 

* r * 

rules, the number of papers rated was counted* This simply provided an 

• . ' * 

overall measure of what kind of attention item-writing -research has received 

rt . * * » 

in the empirical literature* . . • 

■ ■ . ^ ; 

The second' question dealt with the frequency with which each item-writing 

* ■ m 

o 9 • 

rule had, been studied. A frequency distribution w a $ created for t*e 45 rules. 
• It was more difficult to draw conclusions about the validity" of rules, 




which wast the point of the third research 'questio/w All studies with a 
frequency of two or more were subjected to additional review to determine if 
any consensus qould*b? found among the studies. The intent was to discover if 
the rule had analytical or theoretical support as well as empirical support. 

For some rules* it; was possible, to synthesize all studies that discussed the 

» -# 1 

* RESULTS AND DISCUSSION C 

To answer the first" question of this study, 56 studies were identified 

that addfre*ssed at Least one of the 45 item-writing rules. These studies 

varied widely, with respect to types af tests, te^: length^, sample sizes, and 

* 

educational levels of samples. All of these studies were published between 
1925 and 198A. k As the availability of computers improved studies in the 1970s 
and 1980s, the -method bf\statistical analysis changed significant)./.* 
Nonetheless, some of the best designed and most comprehensive studies were 
completed in' the 1920s. 

^ Table 2 provides the frequency distribution of , the rules ^studied most 

often. As shown t{iere, only, the rule dealing with thr> optimal number of 

■** * 
* * 

option^ has received major attention, while five other rules have received 



moderate attention* All other rules were cit^d four or fewer times; seven 
rules received only oqe icitation, while 24 rules were not cited at all* The 
balance of this section will be devoted to discussions of the research on the 
most frequently studied rules. 1 k 

i * 

Rule 26: Use three, four, or five options for an item . ^ 

'Studies of the ideal number of options can be divided into two discrete* 

- * 

groups: (a) theoretical add analytical, or (b) empirical. Each will 
discussed in turn. % * 

The fcarl^iest study of option nuteber was by Lord (1944), who developed a 

formula for predicting changes in reliability as a» function of the number of 

• * * I 

options added to a multiple-choice iteijw Lord's data suggest that three- 

option items are optimal.* Tver sky (1964) developed three criteria 

(discriminabflity, power, and information of* a test) to evaluate the number of 

choices it a multiple-choice item. He concluded that {1964, p 390): 

Whenever the amount of time spent on the test is proportional 
to its total t number of alternative^, the use of three 
alternatfves at each choice point will maximise ' the 
amount of information obtained per time unit. 

This finding has been supported* in subsequent studies by Ebel i(1969), 

Grier (1975; 1976), and by Lord (1977). Lord's study (1977) is most . 

informative about the point at which three-option test *i terns are most 

effective. Using item response theory, Lord presents item efficiency curves 

to show that the three-^ptiop item, provides maximum information in the mid- 

range of the score scale, wh^le the true-false item provides rffbst informatibn 

»» » * 

£or high-sc6ring examinees and the four-option or .five-option item provftes 

» * * % 

the most information for low-scoring examinees • This is an interesting and 

important observation that takes into account the prominence of guessing among 

* 

low-scoring examinees. (And$ of course, the four-option or five-option item 
otters more protection against guessing.) Because high-%scoring examinees are 



less likely "to guess** two options are sufficient for them* For most 
examinees . providing three options appears to be optimal* j 

Test reliability is the only index of overall test quality compared in 
all empirically based studies oh the optimal number of options for multiple- 

choice items* Other "characteristics of items and tests compared in these 

■* • ■ * 

studies were item. difficulty and discrimination, validity . test*score 

variance f and efficiency* 



Reliability * Table 3 presents reliability coefficients for the ten 

stoics that report reliability coefficients for tests with various numbers .of 

options* These reliabilities are computed by different formulas under the 

conditions. of various test lengths, sample sizes, educational levels • and test 

content. 
* 

Ta*ble 3 shows that reliability is. in general, a monotbnically increasing 

i * , 1 

function of thai number of options, but that the incremental gain in 

• i . ■ 

' i w 

reliability is small when more than three ^options are, used, T£ie authors of 

these studies conclude that when efficiency is" taken into account-rthe extra 

effdrt needed to create additional options, and the extra time needed for 

students to respond to longer items— either three or four options maximize 

reliability. - * 

Efficiency . Several* studies Examine efficiency for various numbers- of 

options. Efficiency is variously defined from "absolute time to complete! 1 . to 

the relative efficiency of information bits gained per unit time. For 

example, Williams and Ett§l (1957) f conclude that two or three options are most 

efficient, but in ah earlier study. Ruch and IStoddard (1925) state that two or 

J ' . 

five options maximize efficiency. In general , -these studies conclude that 

three or four options are most efficient. 
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1 

* Item difficulty and discrimination * Several studies report item 
difficulty and discrimination, and the results are mixed and contradictory. 

m 

For example, Charles (1926) reports that five option items are the most 
difficult and two option Items are <be least difficult* Costin (1972) reports 
no diffgrfflfces in item difficulty or discrimination between three and four 

options* Park xnd Some^s (1983) show no differences in difficulty between 

# * -* 

four and five option i: terns, - Straton^and CattS ( 1980JUcompare two, three, and 
four options and report 'that three-option and four-option item* are nearly 

r »' * 

# > 

equil in difficulty,* but" that three-option items, discriminate better than 

x * 

,four-option items. 

In summary, the relationship of the number of options to test reliability 
'is the most frequently studied item writing practice. In general, test 

* 

reliability is shown to vary directly with the number of options from two to 
f ivfe. However, the incremental gain in test reliability when a fourth or 
fifth option is added is very small. In general,' tht?»* studies show that test 
efficiency is to bp maximized by three or four options. Item difficulty and 
discrimination show mixed results for two to five options and no conclusions ♦ 
Sre warranted. Validity w/ s studied in only one study (Ruch & Stoddard, 
1925), which' showed, that five options increased criterion-related validity. 6 
Rule Ig: Avoid type-k items * * 

Of the s.tudies involving type-k items, all but one involved comparisons 
with the type-x (multiple true-false) format. Therefore, comparisons with 
conventional multiple-choice were limited to twa studies, but the 'other 
studies, involving the x-type items, provide additional insights about tha 
type-k, , . % 

Parker and Somers (1983) compared the type-k format wit*^ four-opt ion and 
five-option multiple-choice items, and* found the type-k formal, more difficult 
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4 • 

as\ we^l as ' less reliable than the other two formats* Hughes and Trimble 
(1965) compared a precursors* of the type-k format, tfhere the faption "both are* 

4 , * * 

correct:" is used./ Their findings* indicated higher reliability for . the '"both" 

option as well, as higher variance of test scores. Difficulty was . % **• . S 

unaffected. In a replication, they found that the "both** (option, when» 
* . • • • «• > . * • • 

.-compared "to a conventional format t increased reliability and variance and also 
\ , ' . » * # *. . 

produced mors difficult items. A second replication yielded re'sults similar , 

to those of the first replication — more difficulty and greater reliability., , 

No effect on item discrimination was detected. This contradicts* the finding 

of higher reliability for tins Tfonnat^ because item discrimination and 

reliability are functional ty related* 

* 

The results og the studies involving type-k dhd type-x items are somewhat 
mixed. Further, this research is somewhat confounded because scoring system 
for typerx items vary significantly, and because the chance levels for type-k 
and type-x items are 'not the same, which makes test ^scale comparisons somewhat 
problematic. * 

Regarding item difficulty, the results of the. studies are mixed, perhaps 

'owing to the variety of scoring methods for type-x formats. Albanese, Kent, 

• i »■ 

*>> . ■ 

and Whitney (1979), Harasym, Norris and Lorscheider (1980) and Kolatad, 

i •-,•«• , # 
Briggs, Bryant, and Kolstad (1983) report the type-k format produced .easier 

items, while Albanese, Kent , t and Whitney (1977) found the opposite.. None, of , 

these studies addressed item discrimination.. Three studies ^ ^banese et al., 

1977*, Albanese et al., 1979; and, Harasym et al., 1980) all reported lower 

reliability* with the type-k format, while only Hill and Woods (19/6) report no 

Terences between type-k and ,type~x formats. Wi'th' respect to validity, only 

> studies &Hill & Woods, 19*6; Albanese et al., 1979) reported no 

aifferences, m « ' : 



In summarizing these results, it must be noted that the paucity of 
studies comparing type-k with conventional multiple-choice items is a serious 
limitation. However, these studies present soae strong arguments against 

■ > 

typs-k formats. 

s 1. In most circumstances, this format seems to produce more difficult 
items. Although increased difficulty need not be a problem, it can be if not 
taken into account when a test with both typ^e-^ items and items that have 
other formats is assembled. 

2. The suspicion that type-k items provide clues is shared by Htfrasym et 
ai., (1980) and Albanese (1982) who offer evidence in support of this 
belief. It appears^hat, unlike knowledge about the truth of a primary 
option, knowledge about the falsity of an option helps eliminate the secondary 
choices in the type-k format.- It therefore seems very possible that type-k 

♦ 

items help clue examinees, particularly low-scoring ones. However, this. needs 
to be more extensively studied. • 

3. It is clear that type-k items are less reliable in most instances. 
* 4. Perhaps the most, compelling reason for rejecting the type-k item is 

that it is more inefficient to construct and more laborious to read. More of 
the conventional multiple-choice items than the type-k items can be given per 

unit of time. ' 

5. Finally,. the finding of Hill and Woods (1976) that students prefer x- 
type aver k-type cannot be ignored. Although hardly a sufficient condition, 
face validity is certainly necessary in the choice of a' test format. 
Rule 29; Keep the length of options fairly consistent. 

Eight studies concerning the effect of, presenting the keyed option* as the 

» t ' , » 

longest, alternative were reviewed. All of these studies evaluated the effect 
of the key being the longed option on item difficult/, while some studies 
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also evaluated the effect: of this flan .on item discrimination, test- 1 
reliability, and .concurrent validity. -J. 

• • , X . . , ■ 

Board and Whitney (1972) found that length of keyed options madel^no 
overall difference in test difficulty; However, \ they also found that less 
able students tended to us^ the clue of longer keyed options more than abler- 
students. Both test, reliability and concurrent validity were decreased' by the 

length flaw. In the design of this study, course final examination / score was 

. . . • ■ / 

used as a blocking variable. 

Chase (1964) also found that the length of the correct option had no 

• " t 

effect on difficulty, but concluded that the response set to select the 
longest option interacts with item difficulty. For more difficult items, t, 

t 

then, students tend to use the length clue, but for easier items they do not. . 

All other studies reviewed concluded that the length flaw produced easier 
items. Jones and Kaufman (1975)., in a study of response set, found that 
higher-sloring students use the length clue more than do lower-scoring 
students.,- An internal total test score criterion was used to block high and 
low-scoring students in this study. ' 

Evans (1984) and Strang (1977) found longer keyed options to be easier 
than sf^r^er* keyed options. Dunn and Goldstein (1959) and McMorris et al. 
(1972) cJLluded that the length clue made* items easier, but had no effect on 

reliability and yalidity* Weiten (1984) also found the^longer keyed 

, • , < y 

alternative to be easier, but there was no effect on item discrimination, test 
reliability, or validity. 

In summary, most studi.es conclude that the u3e of long correct options 
makes items easier. In the only two studies that *note no such effect, student 
ability was used as a blocking variable with contradictory results. The 
difference in measures of student ability used in these two studies may 



. .. - r 
-. * ). 

account .for the contradictory findings. This item writing flaw lowers test 
reliability and concurrent validity in only one of the eight studies reviewed. 
Rule 30; Avoid the use of "none of the above". 

N A total of six studies that discussed the use of "none of the above" were 
reviewed* These studies examined the effect of this option on item difficulty 
and discrimination and 6n test reliability and validity. 

■ - —v -> 

Schmeiser and Whitney (1975b), in an extensive ( study of the use of the 
"none of the above" option, found that the effect ort difficulty and 
discrimination on tests of different subject matter was mixed. According to 
their findings, test reliability and validity were slightly decreased by the 
use of this option. 

Wesman and Bennett (1946) observed no effect on item difficulty and a 
mixed result on item discrimination* The data from this study are, hoyever, 
Difficult to interpret. A mixed .effect on test difficulty and 'item 
discrimination were also reported by Williamson and Hopkins (1967). However, 
examination reliability was lower for this type of v option. 

Studies by Dudycha and Carpenter (1973), Hughes and Trimble (1965), and 
Rimland (1960b/ concluded that "none- of the above" increased iteav 
difficulty. Dudycha and Carpenter (1973) and Hughes and Trimble* (1965) also 
found lower test reliability, but Rimland (1960b) did not evaluate the effect ' 
of this practice on reliability. 

In summary, no conclusion can be reached about the effect of "none of the 
above" on item difficulty. Three studies found that this option increased 
item difficulty, but three studies reported no effect or mixed results. Three 

c 

studies found that the^use of this option lowers test reliability. There is 
no consensus about the effect on item discrimination. Only one study • 
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evaluated concurrent validity effects OSchmeiser & Whitney, 1975b), and it 

found validity slightly decreased by use of "none of the above." 

Rule 17: State the stem in either ,a question form or a sentence form. 

P * 

This £t*ie has received moderate, attentifcn in the research literature. 

Six studies are cited (Board & Whitney, 1972: Dudycha & Carpu.iter, 1^73; Dunn 

& Goldstein,- 1959; Schrock & Mueller, 198Z? Schmeiser & Whitney, 1975a; 

1975b). As 'n cost othe instances, the test lengths, test types, educational-^ 

f *■ ' — 

levc\s of examinees, test content, and other factr * vary significantly across 
these papers. Despite this variability, .some definite trends in findings 
about this rule can be reported. In four of. the six studies, incomplete stems 



were found to be sore difficult. While the practical magnitude of thitf 
difference is small f it could affect test assembly, because a preponderance cf 
complete stems will produce a systematically more difficult test.- 

Typically; d' scrimination was unaf f ectedToyTrciTnpleteness of the. stem. 
Reliability and validity appear to be slightly *ff?<ied, but this result may 
b"j du* 10 the way i which uiscriminaticn was calculated. When the upper 2 72 
i o* lower 272 inde» s used iriste^n of the poir.*-bi serial , discrimination may 
not be accurately estimated, since the former is only an approximation of the 
mor< desirable latter. Since discrimination and reliability are functionally 

S ; 

related, significant differences in discriminate on logically lead to 
significant differences in reliability. . ' 

Thus, based on this limited set of studies, it is possible to draw the 

\ 

preliminary conclusion that the incom^iete^tem i$ a less effective item- < 
writing strate^/ than the question format. WhU^S the differences betVeen the 
two stem types are slight, the replication of findings builds support for tfyisi 



conclusion in the abaence of- further studies. / 

/ 
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Rule 3?: Balance the key? that is, make sure the correct option is found an 
equal number of times in each 6ption position, if possible. 

Six studies concerning key balancing were reviewed/ in the present 

■»'•' 

research. The results of these studies were mijced. Four studies reported 

nhat the position of the keyed response affected item difficulty and two 

* *" — ' 

studies reported the opposite. 

' „ ' '< 

Ace i"r.d Dawis (1973), Jones and Kaufman (1975), Evans (.1984), and 

McNamara and Weitzman (1943), report that the position of the key is related 

to item difficulty. Ace and Dawis (1973) found that the fifth position for 

the keyed response war the most difficult for^ examinees and the third position 

wa. next most difficult. 

Both Marcus (1963) and Wilber (1966) report that there is no evidence of 

a positional response set or a relationship between position of the key and 

difficulty. 

Rule 20: Don't clue through grammatical errors. 

This rule refers to the inadvertent use of incorrect grammar to clue 
ej.aminees to the correct option. Only four studies can be reported which have 
studied the validity of this rule. 

Evans (1984) reported that grammatical cluing made items easier and . 
increased the variability of the test scores, McMotris, Brown, Snyder, and 
.Pruzek" (1972) found thajr"this fault made items easier, but no effect was noted 
on reliability or validity. Wei ten (1984) found that difficulty and 
discrimination were not affected by grammatical inconsistency. Interestingly, 
the results for reliability were mixed, and the results for validity were 
inconclusive. Huntley and Plake (1980) found no support for cluing through 
grammatical inconsistency. ^ , 

Nevertheless, it seems sensible to avoid grammatical error, just to 
support the fate validity of the' test. In the absence of more conclusive 

i 

14 



empirical evidence, the rule should stand on the grounds that grammatical 
clues detract from face validity. 

Rule 16: Avoid window dressing in the stem, \ 
The effect of window dressing— extraneous materialman the stem of the 

f K * 

I * 

item was investigated in four of the studies reviewed. . 

Rimland (1960aV found that window dressing decreased test reliability, 
discrimination, variance, and concurrent validity. Schrock and fjueller (1982) 
reported that window dressing made test items more difficult and took students 
longer to complete than items without window dressing. However, this item 

flaw did not affect test reliability. 

/ 

Board and Whitney (1972) found that less able students performed better 

on items with window dressing than more able 3tudents. There was rto overall 

effect on mean test difficulty, but a decrease in test reliability was 

reported* However, Schmeiser and Whitney (1975a) reported little or no effect 

\ 

of window dressing on item difficulty or test reliability. 

In summary, three of the four studies* reviewed suggest that window 
dressing has an adverse effect on at least some students . 

Rule 24: Don*t leave blanks in the middle or ttie stem* * 

, — ■■ - ■ ... — — # 

t- 

This rule is similar to the rule about using a question stem rather than 
a stem that is an incomplete statement. The rule acose' from verbal analogy 
items, so it has limited applicability, but the effect that leaving blank? in 
the middle of the^tem sentence has on item and test characteristics may be of 
interest. Silverstein and McCiain (1963) were among the first to examine the* 
affect of blanks in items, although they allude to a study by Campbell (1961) 
in which a design flaw makes the 'results questionable. Silverstein and 
McCiain (1963) found no effects when the blank was systematically varied in 
the stem; Ace and Dawis (1973) describe the dispute between Campbell and his 

w \ . ■ « 
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adversaries and offer partial support for both sides. Changes in the 
structure of the analogy did not f change difficulty, but the interaction of 
this change and the position of the correct response did appear t;o affect 

difficulty* . 

} 

Schrock- and Mueller (1982) offer the only study that addresses this *ruie 
as it applies to items not based on analogies. 1 <eir findings seem to suggest 
virtually no effect on difficulty, test score variance f or response time bat , 
raixecj results were reported for^reliability. 

. The findings from these reports 'suggest that this rule is 'still strongly 

\ 

in need of further ^study. However f there does, not seem to be any harm in 
leaving a blank in the middle of the stem. Until mot;e evidence is marshalled 
to support it, the rule appears to have questionable validity, 

' . .. \ . 

Rule 38: Use plausible distracters. 

This rule*, like several others, appears to be based on common sense; 
empirical' testing seems hardly necessary. Yet three very different studies 
discuss its applicability. * • 

# 

The first of these, by Weiten (4984), compa^pd plausible and 'implausible 
options, and found that flawed items were less difficult, but not less 
discriminating. No differences were observed for reliability or validity, 
since the variance of test scores was maintained so that the testwiseness 
clues in these implausible distracters assisted all ability levels of 
examinees equally. 

r 

v 

the tendency for students to \determin^ the right answer by using a Reachable 
strategy. Smith . concludes ^that test-taking may be a learned skill, anVthat 
learning the skill may affect test. scores. If distracters are written as 
variations of correct answers, as Smith contends, then convergence theory may 



Smith (1982) used a very small sample of Students and items to examine 



explain the development of testwiseness and m£y indicate that test scores are 

artificially inflated if distracters are plausible. v . - 

The third study, by Owens, tfanna, and Coppedge (1970), compares three - 
* » * — 

methods. of generating or selecting distracters: the judgmental method, the 

frequency" method, and the discrimination method. The judgmental method^ in 

which the item write!: invents the most plausible distracters, was directly 

compared to the frequency method, in which the actual responses that students 

made to, open-ended questions were tallied and those written most frequently 

were used in subsequent multiple-choice tests* Results were igixed, at least 

with respect to reliability arid Validity. Difficulty and test score variation 

did not seem to be affected. 

< * St 

On the surface, it seems obvious that implausible distracters are not 
desirable. Yet two of the three studies provide compelling evidence chat / 
plausible distracters may be more easily eliminated by testwise iexarainees. * 
The study by Owens et al. (1970) suggests that distracters for a test should 
be field tested and that the distracters steould be chosen because they have 
negative discrimination and negative item characteristic curves* 
Rule 40i Don't use distracters that clue testwise examinees. 

Sarnecki (1979) has presented a very complete analysis of testwiseness. 
Testwiseness is an examinee characteristic and thus outside the scope of this 
review;* however, some elements of jffera writing are influenced by 
testwiseness. Only three studies that discuss cluing answers by violating 
item-writing principles other than the rule of grammatical consistency were 
ide^t^fired^ S 

Each of the three studies* focus on the repeating of a word or phrafce from 

the stem in the correct option, which is a testwiseness clue. McMorris et al. 

* 

(1972) and Weiten (1984) found that only item difficulty is affected by such 
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cluing. While Pyrcsak (1973) did not replicate these, findings, he did find 
chat testwiseness cou.d be taught and that some students could increase scores 
after trailing. fc 

Despite the sir -arity in the design of these three studies, Che rule 
discussed , appears Logically sound* The use of specific determiners (e«g» 9 ' 
"always" and "never"), the use of cognates in the stem and correct option, and 
tiv ^se of ridiculous options should provide unfair advantage to testvise 
students*. Thus the rul£ should be supported on the grounds of prudence, but 

should be interpreted in light of .the findings of Smith (1982) and Weitep 

• • ; • * 

(1984;, discussed for Rule 38. 

Rule 37: Use common errors of students for distracters , * ,r "' 

This rule was briefly mentioned in* conjunction witjythe study by Owens et 
al.-. (1970.) supporting Rule 36. Their method for generating distracters was to 
use a completion format and have students, respond. The errors that appeared 
most frequently were the bases^ for constructing distracters , and produced good 

t 

results according to their study. To take this principle a step further, 
student errors might also be evaluated in terms of their discriminating p£wer; 
such disfcracters should have negative discrimination and. negative item 
characteristic curves. The study by Powell and Isbisvter (1974) is unique in 
this review and worthy of more extensive attention. It examined the response 
patterns inherent in correct and -^correct answers, challenged the assumption 
that no useful information is available in, wrong answers . This work suggests 
an interesting propositi-.i that has received recent attention in other, more 
theoretical discussion of item writing: that items should have diagnostic 
distracters that provide information that not only increases test reliability 
,(Haladyna, 1984) but, also permits diagnostic instruction* Otoid, 1984). 



/ 

Ryi e 31: Avoid the use of "all of the above/ 1 

Whil^ most textbook authors recommend against using the "all <?f the _ 

* 

above" option, only two studies can be reported here that address t;his rule. 
Ducfycha and Carpenter (1973) report that use of "all of the above" makes items 
more difficult and less discriminating. Hughes and Trimble (1965) report that 
items that use the option "both of the above," described earlier as a 
precursor to the type~k format, are more difficult, but that both variance and 
reliability appear to be increased. These findings contradict those: of 
Dydycha and Carpenter (1973). ' 

In light of this gisagreement, it is difficult to evaluate this rule. 
Authors and "teachers are cautioned against recommending such item^writing 

practices without more experience or data to support such a rule. 

* » 

Rule 32i Use t he option "I don y t know." *' ' - 
. > N ^ . v 

is option is intended to reduce the incidence of examinee guessing. 

Sanderson (1973) examined "don't know" in a clinical education setting and 

r 

found that theife was a slight distortion of scores by those sing this 

option. Sherman (1974) examined National Assessment of Educational Progress. 

data and found differences according to age, region, 'ethnic background, and 

even personality, in response patterns to this option. These findings are 

particularly impressive since these data are a national probability sample ^ 

representing the entire United States. 

Although only two studies have examined this rule, the evidence appears 

*** 

overwhelmingly in favor of rejecting its validity. . Although it is meant to 

reduce guessing, guessing is confounded and testwiseness is rewarded^ It is 

* 

thus difficult to justify the use of an "I don't |tnow M option. 
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Other Findings 4 

Seven other ' ruLes received only one citation -in an empirical study. The 

w- ■ 

findings are presented in Table 4. In this table, Hie author and rule are 
identified and a box score is used tp determine whether the rule^is supported 
on various grounds such a** difficulty, discrimination, reliability, and 

validity* These findings are presented here for completeness but are not 

■ • .* * 

discussed further because only one study has been identifies for each rule. 
The remaining 24 rules received no attention. 

CONCLUSIONS 

Only 56 studies were found; to, bear directly ort the validity of 45 item- 
writing .rules: testimony enough that there has not been sufficient research' 
to support most of then, although common stnse and face .validity suggest that 



tested is directly . 



many of the rules are legitimate. 

A «* m ■ * 

The frequency with wh£$h rules have been empirically 
related to the number of studies. Many of these studies address, rn^re than orr 
rule, but few rules have been studied more than fo6r times, and many rules- are 
substantiated by little or no empirical research. 

The optimal number of options that a multiple-choice i. rem should have has 
received considerable attention. Empirical research supports theoretical 
study in indicating that three options achieve the optimal balance between 
reliability and efficiency. It is suprising, considering the evidence, that 
virtually all authors of textbooks fa^or multiple-choice items with four or 
five options and that nearly all standardized tests use more than three 
options. , * 

The other rules do not -have a firm foundation in research. * Further study 

• . A 

of the validity of i tem-wri t ing rules is necessary. The paragraphs ^that^ 
follow suggest some fruitful areas for exploration. 

20 
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I. It is desirable to find methods to improve the developnent of items 

\ - . • ' * 

that measure higher-level thinking* Fdw of the proposals in textbooks and 

* * 

other sot»rces (e.g., Miller * Williams, and Haladyna, 1978), have met with 

practical success. 

2^ The research to date, parti^ula^t^ of Lord (1977), indicates . 
■* 

that item performance improves -vhen distracters, have negative item 
characteristic curves (i*e., negative discrimination)* Ideally, distracters 
should have jl diagnostic value* When a student selects a distracted,' some 
valuable corrective teaching should be possible* A procedure like the. answerr 
untii-correct is a steg, in this direction and may prove to be a' rewarding area 
for research. . 

3* The large number of rules yet unstudied provides a source for future 
research* * Item writers need to know the merits or demerits of the "all" 
pption, the "non£" option, and the "don't know" option. 
Methodological Concerns * 

Many studies reviewed for this paper -are flawed. Furrier studies on 

item-writing rules must, to be of value, have a <sound experimental design* 

* «■ 
Each of. the factors under consideration rftust be well defined and c6mpletely 

tested via main effects and interactions* The samples of items and examinees 

must be sufficient to maintain a reasonable power for statistical tests. ' 

Item difficulty and test difficulty have been vastly overemphasized in 

/ . ■ 

studies of \ tern-writ jlng rules. The effects of an item-writing* practice on 

*? 

discrimination, reliability, validity, and efficiency are much more important 

/ * 
t \ 

and merit more attention. IRT methods may >rovi^e important insights to the 
effects of item-writing practices on test characteristics. 

It is imperative that studies report the basic data used for analysis. 

T 

Means and standard deviations are vital if the results of the study are to be 
properly interpreted. 

24 
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Statistical tests are only the beginning of an analysis. The researcher 

« .... - 

should routinely report the effect size so that a standard can be used to 

evaluate a result. For large samples, virtually the smal lest , most trivial 

difference is statisticafly significant. For small samples, a very large 

difference may be statistically insignificant. 

Finally, it seems appi "iriate that item-writing practices should be based j 

on item-writing theories, such as those suggested by Bormuth (1970) and Hively 
m 

(1974). In addition to aiding in the definition of the. construct to be % 
measured (a necessary condition for the desirable construct validity), these 

theories also provide the bases for empirical research on the development of 

»» 

mult ipie-choice items. 
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Table 1 



Coding System for ' 
Classifying Instructional Statewents on 
Mi ltiplq-Choice Item Writing 



V 



Ceneral Item-Writing Advice 



i . 

2. 
3. 
4. 
5. 

7. 



Avoid textbook, verbatim phrasing of <ems. 
Avoid trick questions. 
Avoid opinion-based items. . 
Base each item on an educational objective. 
Use types of items that elicit higher-level thinking (vari 
give examples and specific advice) 
Test for important facts and Knowledge. 



authors 



General Advice 



Avoid items which require oyerspecif ic knowledge. 

) • • 
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3. 
9. 

10. 

11. 

12. 
13. 
14. 



Minimize examinee reading by limiting item length. «\ 
Use good grammar consistently, making sure th&t the item and the » 
options agree grammatically. • . 4 \ 

Focus on a single, c;lebriy defined* problem in .phrasing the quefcTuon. 
Consider vocabulary when phrasing the item; k^tep it appropriate for 
the intended audience. * . , < 

Allow sufficient time for the development, review, and revision oj^ 
the item. \ r v 

Avoid interdependence of items or avoid allowing ope item to cue ^ 
another. * - . m * 

Format the item either horizontally or vertically. •„ 



Item Advice Focusing on Stem Construction 

15. Ensure that the directions in the v item stem are clear and that 
wording lets the examinee know what is being tested. * 

16. Avoid window dressing (extraneous materials) in the stem. 

\1 . State the stem in either a quest^pn form'or a sentence forn^with 
options completing the stem. " 

18. Use either th$ best answer or correct answer fo<$iat . 

19. Avoid type-k 0 items, i.e. , items that list a series of statements and 
then provide combinations of these statements as options. 

20. Don't clue the correct response through a grammatical error. . 

21. Word the stem positively; avoid negatives. 

22. Make a good transition from the stem to the options. 

23. Include the central idea and most; of the text of the item in the 
stem. * . 

24. Stems should be left op^n at the end; don't leave blanks in the 
middle of the stem that refer to options. 
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Item Advice Focusing on Option Construction 



General Advice 



25. Items with different numbers of options may appear on the same test. 
- 26. Use three, four or five options for an item. 

27. Keep a logical order to options; if quantitative, keep options in 
ascending or descending oruer. 

28. Keep options independent from one another.' 

29. Keep the Length of options fairly consistent. 

30. Avoid the use of "none of the above". 
, 31. Avoid the use of "all of the above. 1 ' 

32. Use the option "I don f t know." 

33. Keep options homogenous in cone mt and grammatical structure. 

34. Phrase options positively, not negat .vely. 

». 

Correct Option 

35. Balance the key; that is, make sure the correct option is found an 
equal number of siraes in each option position, if possible. 

36. Make sure there is one and only one correct option. 

Distracters 

37. Incorporate common errors of students in de /eloping distracters; 
anticipate what distracter is most likely to attract unprepared 
examinees. 

38. Avoid illogical distracters; use plausible distracters. 

3S. Avoid specific determiners (e.g., never, always) in distracters. 

40. Avoid distracters that can clue testwise examinees. 

41. Avoid technically phrased distracters. 

42. Use incorrect paraphrases as distracters. 

43. Use familiar-looking but incorrect statements as distracters. 

44. Use true statements that do not correctly answer the question as 
distracters. 

45. Use irrelevant clues for distracters. 
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Table 2 

Frequency of Studies for Each Item Writing Rule 



Number 



Rule 



Frequency 



26 
19 
29 
30 
17 

35 
20 
16 
24 
38 
AO 
37 
31 
32 



Use three, four, or Si&e options for an item. 
Avoid ty«pe-k items. 

Keep the length of options fairly consistent. 

Avoid the use of "none of the above. M 

State the stem in either a question form Or 
a sentence form. 

*/ 

Balance the key* 

Don't clue thro»-f> grammatical errors. 

m 

Avoid window dressing in the stem. 

Don't leave blanks in the middle of the stem. 

Use plausible distracters. 

r 

Don't use distracters that clue testwise examinees 
Use common errors of students for distracters. 
Avoid the use of "all of the above," 
l&e-l the option "I don't know." 
7 other rules had one study each. 
24 other rules had no 9p*rtflTSs cited. 



18* 
8 
8 
6 
6 

6 
4 
4 
4 
3 
3 
2 
2 
2 
1 
0 
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Table 3 

* 

Reliability Coefficients for Items 
of Two to Five Alternatives 



Number of Options 



Charles (1926) 


2 

.477 


t 

J 

.624 


A 
*♦ 


Costin (1970) 




.560 


• jOU < 


Costin (1972) . 




.750 




Parker & "Seiners (1983) 






• 532 


Ramos & Stern (1973) 






.860 


Ru|h & Charles (1928) 


.477 


V. .624 




Ruch & "Stoddard (1925) 


.737 


.598 




Straton & Catts (1980) 


.470 


.730 


.680 


Wakefield (1958) 


.860 


.890 


.920 


Williams & Ebei (1957) 


.954 


.945 


.945 



i 

\ 



29 



t 

26 

«1 



Baker (1971) 



* Dunn & Goldstein 
(1959) 



Table 4 

Effects of Rules Evaluated in Single Studies 
Test Characteristics 3 



Rule 



Dudycha & 
Carpenter (1973) 21 



9 



Kolstad, Goaz, 

& Kolstad (1982) 36 

Strang (1977) 41 

Strang (1977) 43* 

Terranova (1969) 34 



Difficulty 



Discrimination 



♦ /I 



Reliability 
0 



Validity 



Note ; Interpretation of symbols 

a. Positive effect + 
Negative effect - 
Inconclusive effect 0 
Mixed effect +/- 



0 
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